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Introduction 


With the emergence of the general-purpose supercomputer, it has become 
possible to perform accurate numerical simulations of a wide variety of com- 
plex, three-dimensional scientific problems. Yet there are many problems of 
interest which are so complex that, in order to simulate them numerically, 
supercomputers with computation rates and storage capacities orders of magni- 
tude larger than those of current supercomputers are needed. Furthermore, the 
computational requirements of such problems will continue to surpass the 
performance of general-purpose supercomputers for at least the next decade. 
As a result, increased attention has been directed towards an emerging class 
of parallel-processing supercomputers. 

One such parallel -processing supercomputer is the "Navier-Stokes 

Computer” (NSC) [1, 2], which is being developed by Daniel Nosenchuck and 
Michael Littman at Princeton University. The NSC consists of multiple (up to 
128) local memory parallel processors, called Nodes, interconnected in a 
hypercube network. Each Node has the power and speed of a Class VI super- 
computer, giving a projected, full 128 Node NSC a storage capacity in excess 
of 262 Gbytes, and a peak speed in excess of 61 GFLOPS. 

The computational speed and efficiency of the NSC are derived in part 
from a Nodal architecture which provides parallel operation of both the memory 
and the computational units. Parallel operation of the memory is attained by 
distributing the local memory of a Node over multiple interleaved memory 
planes, all of which may operate simultaneously. Multiple operand vectors may 
then be streamed from memory into a Reconfigurable Arithmetic and Logic Unit 


(RALU), while multiple result vectors are simultaneously routed from the RALU 
back to memory* The RALU architecture involves the interconnection of multi- 
ple high-speed floating point and logic processing units into single or multi- 
ple parallel pipelines. Since the RALU is dynamically reconf igurable, many 
different parallel pipeline configurations may be formed from the processing 
units. Efficient implementation of an algorithm on the NSC thus entails the 
effective utilization of not only the coarse grain parallelism associated with 
distributing the problem over the Nodes , but also of the fine grain parallel- 
ism associated with configuring the parallel pipelines and with routing multi- 
ple vectors between the RALU and memory. 

Although the NSC was originally designed for the numerical simulation of 
flows governed by the Navier-Stokes equations, it provides a means for effi- 
ciently solving most any problem for which the computation is numerically 
intensive, and for which the algorithm makes use of long vectors and may be 
parallelized with a coarse granularity using local rather than shared memory* 
The objective of this paper is to detail the procedures involved in implement- 
ing one such algorithm on the NSC* Particular attention is focused on proce- 
dures for mapping the computational grid into the Nodes, for configuring the 
RALU pipelines, and for allocating memory planes for storing variables such 
that multiple vectors are efficiently routed between memory and the RALU* The 
resulting analysis constitutes virtually an assembly language level descrip- 
tion of algorithm implementation on the NSC. 

The specific finite-difference algorithm considered herein was developed 
for the direct numerical simulation of laminar-turbulent transition in wall- 
bounded shear flows by solution of the incompressible Navier-Stokes equations. 
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Recent work in this area [3, 4, 5] indicates that numerical algorithms for 
performing this simulation are quite mature. However , it has not been 
possible to continue the simulations up through the onset of fully turbulent 
flow, since current supercomputers do not have the storage capacity and speed 
needed to satisfy the extreme resolution demands of the later transition 
stages. Furthermore current transition simulations must resort to using the 
parallel flow assumption, because the resolution demands of a true spatially 
developing flow are an order of magnitude more extreme. Hence, the numerical 
simulation of laminar-turbulent transition exemplifies the need for new super- 
computers like the NSC, which have the power and speed requisite for perform- 
ing these simulations. 

In formulating the algorithm, the Navier-Stokes equations are temporally 

discretized using a splitting technique in conjunction with low storage Runge- 

Kutta treatment of the advection term. Crank -Nicholson treatment of the dif- 
fusion term, and backward Euler treatment of the pressure term. The resulting 
algorithm requires the solution of a Helmholtz equation for each velocity 

component and of a Poisson equation for the pressure, at each Runge-Kutta 
stage. Solution of these equations, which constitutes the bulk of the compu- 
tational work, is analyzed for two iterative methods, namely, Red-Black SOR 
and ZEBRA 1. Red-Black SOR is a two-color explicit point method in which a 
point Jacobi type iteration is applied at alternating points, ZEBRA 1 is a 
two-color implicit line method in which line relaxation is applied along 

alternating lines. 

It should be emphasized that the purpose of this analysis is to detail 
the procedures involved in implementing a representative algorithm for simu- 
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la ting the laminar -turbulent transition problem on the NSC. The algorithm 
considered in this analysis was chosen for its simplicity, particularly in 
regard to the use of second-order discretization methods and single grid 
iterative methods. More suitable algorithms will surely use higher order 
discretization methods, and employ conjugate qradient or multigrid methods 
based on the single grid iterative methods considered here. However, the 
implementation procedures detailed herein are indicative of the procedures 
required for implementing the more complex algorithms on the NSC. 

In the two following sections, architectural details of the NSC and 
formulation of the algorithm are briefly discussed. Details of the procedures 
for iiqplementinq the algorithm on the NSC, and projected timing results for a 
particular problem, are presented in the subsequent section. Some concluding 
remarks are presented in the final section. 


Architectural Overview of the NSC 

Tfhe NSC is a multi-purpose parallel-processing supercomputer which is 
designed to perform numerical simulations of a wide variety of large, numeri- 
cally intensive, complex scientific problems. Rapid solution of these 
problems is attained through a global architecture which distributes the 
computations over a fairly small number of powerful local memory parallel 
processors, called Nodes. A broad overview of the global and Nodal architec- 
tures is presented below. 
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As the NSC is currently in the developmental stage, certain architectural 
details have not been finalized. For instance, tradeoffs between having the 
Nodal memory distributed over 8 memory planes rather than 16 memory planes (as 
assumed herein), are still under review. However, such alterations in the 
architectural details are likely to affect the algorithm implementation pro- 
cedures to only a limited extent. It should also be noted that the modular 
design of the NSC allows for relatively easy upgrading of the hardware compo- 
nents. Hence, the memory/speed characteristics of the final design may differ 
from the characteristics detailed herein, which are based on the utilization 
of mid-1986 technology. 

Global architecture. - The global architecture of the NSC involves the 
interconnection of the Nodes through two communication networks. The first 
network utilizes a global drop-line bus to link the entire Node array to a 
front-end computer. Although the global bus is primarily used to transfer 
data and commands between the front-end and the Nodes, it may also be used to 
transfer data between any two Nodes of the array. The second communication 
network involves the interconnection of the Nodes in a hypercube network 
[6]* Internode communication links for the hypercube network are implemented 
with fiber-optic transmission lines, providing data transmission rates orders 
of magnitude faster than those provided by the global bus. Consequently, 
during execution of the calculation ^.vCdures for a given algorithm, most (if 
not all) internode data transfers are routed through the hypercube network. A 
schematic of the global architecture, where a subset of the Nodes and a simple 
2-D nearest neighbor interconnect network are illustrated, is presented in 
figure 1 • 
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The front-end is a general-purpose computer which provides the operating 
environment for the NSC* In-line it is used to load data and commands to the 
Nodes and to monitor the Node array* Off-line it provides program develop- 
ment, work station support, and data analysis capabilities* 

Nodal architecture* - The Nodal architecture is designed to permit paral- 
lel operation of both the memory and the computational units, providing large 
throughput rates for a wide variety of computational procedures. Architec- 
turally central to this operation are the multiplane interleaved memory, the 
dynamically reconfigurable ALU (RALU), and the Memory-ALU-Switch Network 
(MASNET)* The interconnection of these devices is illustrated in figure 2* 
Computations are begun by streaming multiple operand vectors from the memory 
planes to the input ports of MASNET* The operand vectors are then routed 
through MASNET, which essentially is a 16x16 nonblocking switch, and enter the 
input ports of the RALU* Result vectors from the RALU are then routed back 
through MASNET and sent to the appropriate memory planes for storage* 

The multiplane interleaved memory consists of sixteen 128 Mbyte memory 
planes, giving each Node a local memory of 512 Mwords for 32-bit words* 
During a given process, each memory plane may be connected to either an input 
port or an output port of MASNET, but not both* Consequently, a memory plane 
cannot be used to store elements of a result vector while simultaneously being 
accessed for elements of an operand sector* Elements of the vectors are 
stored in and accessed from the memory planes using either constant stride or 
scatter -gather addressing* Each of the memory planes has complete address 
translation capabilities to provide full M scatter-gather” capability* The 
address translation information is stored in high-speed look-up tables* 
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The RALU is a pipeline which performs multiple sequential and/or parallel 
operations on vectors. It contains twenty-four floating-point processing 
elements , eight logic units, and various pipeline support hardware elements 
which are interconnected to form the processing pipelines for particular 
computational procedures. The entire pipeline is reconf igurable within one 
clock period, which provides a dynamically changing vector processing environ- 
ment. 

Of the floating-point processing elements, twenty-two elements perform 
addition, subtraction, multiplication, floating to fixed-point, or fixed to 
floating-point operations. In addition, upon performing one of the above 
operations, the element may also be used to change the sign of the result, or 
take the absolute value of it. For these elements, the startup cost asso- 
ciated with processing the first value in a vector is one clock period. Each 
of the other two elements actually consists of four chips which are pipelined 
to form a single element. These elements may be used to perform division, 
evaluate transcendental functions, or take the square root of th£ input 
values. The startup cost for these elements is four clock periods. The 
nominal operation rate of the processing elements is 20 MFLOPS, providing a 
Nodal peak speed of 480 MFLOPS. 

Associated with each of these elements is a constant parameter latch. 
These latches are used to input constants to the elements, and may be set with 
either ALU or scalar results. It takes two clock periods to reset a constant 
parameter latch, and only one latch may be reset at a time. 

The logic units may be used for fixed point addition, subtraction, multi- 
plication, or division, or to perform logical comparisons on either fixed or 


- 7 - 


floating-point values. The startup cost for the logic units is four clock 
periods for performing division, or one clock period for performing any of the 
other operations. As with the floating-point processing elements, the opera- 
tion performed by a logic unit may be changed every clock period. 

An illustration of how the hardware elements in the RALU are configured 
to form the parallel processing pipelines is provided in figure 3. This 
particular pipeline performs a point Jacobi iteration of the central- 
differenced 3-D Poisson equation, 

V 2 P = G , 


on a uniform grid. Here, fifteen of the available floating-point processing 
elements and one logic unit have been configured in a tree structure. Nine 
operand vectors are accepted in parallel at the pipeline entrance. Operations 
on these vectors are then performed in parallel, with results from the opera- 
tions immediately routed to subsequent levels of the pipline. Note that once 
the residual value, is available, it is routed simultaneously to the 
remainder of the pipeline for updating P, and to a pipeline for performing a 
local convergence check. In this example, the convergence criteria is 


*ijkl 


<e • Upon satisfaction of this condition at all grid points, the con- 
vergence flag interrupts the microsequencer, and the pipeline is reconfigured 
for execution of the next calculation procedure in the algorithm. The nominal 
operation rate for execution of this point Jacobi iteration procedure is 300 


M PLOPS . 

As alluded to above, the processing elements and logic units, along with 
various pipeline support hardware elements, are configured into specific 
pipelines by a microsequencer . In general, this unit is used to configure the 
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pipeline at the start of an array operation, and the pipeline remains fixed 
until completion of that operation. However, results from logical comparison 
operations may be used to conditionally reconfigure the pipeline at every 
clock period, without (in general) requiring a pipeline flush. This capa- 
bility permits the vectorization of many powerful algorithms which are not 
vectorizable on conventional vector architecture supercomputers. 

It should be noted that since the NSC is still under development, speci- 
fic details of the RALU architecture, and thus the degree to which the RALU 
may be reconfigured, have not been finalized. Indeed, a major purpose of this 
study was to identify specific pipelines to be included in the final design. 
Therefore, it is assumed in subsequent sections of this paper that the only 
constraint in forming a pipeline is that the number of elements utilized may 
not exceed the number of elements to be contained in the RALU. 

As mentioned previously, the flow of vectors between memory and the RALU 
is controlled by the Memory-ALU-Swi tch Network (MASNET). MASNET consists of 
fifty-six 32-bit switch elements, where each element has two input ports, two 
output ports, and a bidirectional port to a 3 2 -word local memory. Words may 
be stored in the local memory of an element and delayed for up to 1 6 clock 
periods when two vector streams are input, or up to 32 clock periods when a 
single vector stream is input. MASNET is formed by the interconnection of the 
switch elements into a 16x16 nonblocking switch. The entire switch may be 
rerouted every clock period. MASNET is used to route operands from memory to 
the RALU inputs, route results from the RALU to memory inputs and/or RALU 
pipeline inputs (for vector recursion), and to transfer memory contents be- 
tween memory planes. MASNET is also utilized for internode communications, 
which are implemented by transferring data between each Node's MASNET. 
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Another important feature of MASNET is that it permits the generation of 
multiple, parallel vector streams from a single source. In this process, as 
values which constitute a given vector are routed through a switch element, 
they are simultaneously stored in the local memory of the element. During 
subsequent clock periods, the values are retrieved from memory and routed out 
of the switch through the second output port, as a second vector. Since the 
order in which the values are retrieved need not be the same as the order in 
which they are stored, the generated vector may deviate in local ordering from 
that of the original vector. By applying this process over a series of switch 
elements, multiple vectors may be generated from a single source. 

A second means for generating multiple vectors from a single source 
involves utilization of the vector delay and regeneration unit. This unit is 
located between the MASNET outputs and RALU inputs. It may be used to perform 
the same process described above, except that the ordering of values in the 
generated vectors must be the same as that of the original vector. However, 
whereas MASNET may be used to delay vectors for 224 clock periods at most, the 
vector delay and regeneration unit may be used to delay vectors for thousands 
of clock periods. 

The operation of the Node is controlled and monitored by a Node manager. 
The Node manager is primarily used as an intelligent interface between the 
Node and the front-end. It provides initialization, checkpointing, and data 
store handling capabilities, and decodes "macromachine instructions" which are 
used to configure the RALU. In addition, the Node manager provides scalar 
operation capabilities at a rate of 2 MFLOPS. However, it should be empha- 
sized that the Node manager rarely becomes involved in numerical computations 

* 

other than for evaluating expressions for use as constants in the RALU. 
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Internode communication* - Internode addressing on the NSC is supported 


by two addressing modes , global addressing and explicit boundary point defini- 
tion (BPD) • In this analysis, only BPD addressing is considered* BPD 

involves the explicit definition of all boundary-point data (i*e*, data asso- 
ciated with points on the boundary of the computational subdomain, which are 
to be transferred to other Nodes): the source Node of the data, and all desti- 
nation addresses. 

The internode communication process begins as results enter the first row 
of switch elements in MASNET* If a result has been defined as a boundary- 
point value, it is routed to the boundary cache of the source Node while 
simultaneously being routed through the switch element. The boundary cache is 
linked to all of the switch elements in the first row of MASNET by a common 
bus. The clock for the common bus (and the "hyperspace router”) runs four 
times faster than the clock for MASNET and the RALU. Thus, every clock 
period, it is possible to route a boundary value from each of four switch 
elements to the boundary cache. Once the data is received in the boundary 
cache, it is immediately routed to the local "hyperspace router" of the Node, 
and sent to the boundary cache of the destination Node. Then when boundary- 
point data is needed in an ongoing computation of the destination Node, it is 
accessed from the boundary cache of the destination Node and inserted into 
MASNET at the last row of switch elements. As with the first row of switch 
elements, all the elements in the last row of MASNET are linked to the bound- 
ary cache by a common bus. A schematic of the hardware interconnects between 
MASNET, the boundary cache, and the hyperspace router is presented in figure 4. 
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The local hyper space routers are nonblocking permutation switch networks 
which are used to route boundary-point data to the appropriate internode 
communication links. The data are self routing in that the destination 
addresses, which are carried with the data, are used to set the hyperspace 
router switch states. For a 128 Node NSC, the hyperspace routers contain 8x8 
switch networks, and the hypercube internode communication network links each 
Node to seven neighboring Nodes, Although internode communication is most 
easily accomplished for data transfers in which the destination Node is 
directly linked to the source Node, data may be transferred between any two 
Nodes by routing the data over a series of hyperspace routers. 

The internode communication links are implemented with fiber-optic 
cables, providing data transmission in byte-serial format at a duplex rate of 
1 Gbyte/sec. The boundary cache of each Node consists of a 1 M-word write- 
through cache. For BPD addressing, the boundary cache is continuously updated 
by pre -communicating the boundary point data as it is generated in the source 
Node, Thus, current boundary data is usually maintained within each Node, 
which eliminates most of the internode communication overhead. 

Internode communication delays, which result in the temporary suspension 
of computations, may occur in a number of situations. One such situation is 
when boundary-point values are not present in the boundary cache of the Node 
at the time they are required in the ongoing computation. The computations 
must then be suspended while those values are retrieved from other Nodes. 
This type of delay is likely to occur in problems for which the amount of BPD 
data required by each Node is greater than the storage capacity of the bound- 
ary cache. A second situation for which communication delays occur is when 
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the operands for computing a particular term include more than four boundary- 
point values. In this case, since only four values may be accessed from the 
boundary cache each clock period, routing of the operands must be delayed for 
at least one clock period while all of the boundary-point values are inserted 
into MAS NET. 

Delays may also occur in routing the BPD data out of the hyperspace 
routers. This situation arises during burst transmissions where a significant 
portion of the BPD data is transferred between Nodes which are not directly 
linked by the hypercube network. Since the BPD data must then be routed over 
a series of hyper space routers, the switch networks of the routers may become 
overloaded. If the overloading is severe enough, the BPD data will not be 
present in the boundary cache of the destination Node at the time it is 
required in an ongoing computation, causing a temporary suspension in the 
computations. This type of internode communication delay is most likely to 
occur in algorithms where the computations are not localized (e.g. global 
spectral methods). For algorithms in which the computations are localized 
(e.g. finite-difference methods), the amount of data which must be transferred 
between Nodes is small enough that delays in routing the data out of the 
hyperspace routers are unlikely to cause a suspension in the computations, 
particularly since the data is pre-communicated. 
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Numerical Simulation of Laminar -Turbulent Transition 


The particular problem considered in this analysis is the numerical 
simulation of laminar-turbulent transition in plane Poiseuille flow. The 
channel flow geometry is illustrated in figure 5. The governing equations are 
the incompressible time-dependent Navier -Stokes equations with constant vis- 
cosity. Employing the usual scaling for the channel, the lengths are scaled 
by the channel half width, l , and the velocities are scaled by the centerline 
velocity for the mean flow Uq. Written in rotation form, the 
nondimensionalized equations are 


2 * 

u. = uxuj-VP + (1/Re) V u - fi 

■'.'t % N ' S 

V o u = 0 

where 


( 1 ) 


(i) = 7 x u 


P = p + 1/2 | u | ' 


f = (1/Re) 


3u 0 (y) 

3y 2 


u Q (y) denotes the mean flow velocity, and Re = U Q £/v Re Y n olds number. The 
forcing function f represents the mean pressure gradient which drives the mean 
flow. 


Boundary conditions implied in this formulation are the no-slip condition 
at the walls, and periodic boundary conditions in the spanwise direction. For 
a true spatially developing flow, appropriate inflow, and outflow boundary 
conditions are applied in the streamwise direction. However, determination of 
an appropriate outflow boundary condition (a problem which has yet to be 
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The temporal discretization of the above equations involves a low storage 
second-order Runge-Kutta treatment of the advection term and a Crank-Nicholson 
treatment of the diffusion term in the velocity step, with a backward Euler 
pressure correction applied after each Runge-Kutta stage. For simplicity, 
second-order Runge-Kutta is used in this analysis. However, higher order 
Runge-Kutta schemes may be incorporated with only minor modifications to the 
solution procedure detailed herein. With minor changes in notation from eqs. 
(2) and (3), the time discretized equations may be written as 

u* - (h/4Re ) V 2 !!* = U n + (h/2) F n + (h/4Re) ^u" (4) 
% % > >. 


. , n+1/2 . , __n+1/2 

(2/h) (u - u* ) = -VP 

„ n+1/2 

Vou =0 


u** - (h/4Re) V 2 !!** = u n+1/2 +h(F n+1/2 + (h/4Re) V 2 u n+1/2 


(2/h) (u n+1 - u**) = -VP n+1 
V o u n+1 = 0 


(5) 

(6) 
(7) 


where F = u x « - fi. and h denotes the time increment. In order to make the 

■w N. -N. 

pressure correction step amenable to solution, the divergence of the pressure 
equation is taken and the incompressibility constraint is enforced on it. 
Absorbing the time increment into the pressure value, then from eq. (5), 


V 2 P n+1 /2 = 


V o u* 
**» 


( 8 ) 


(Equation 8 is the consistent Poisson equation. It represents the composition 
of the discrete divergence operating on the discrete gradient of the 
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resolved adequately) is particularly difficult in that conditions at the 
boundary are not known a priori. Although the outflow boundary conditions 
utilized in previous low resolution simulations of spatially developing shear 
flows have a relatively small upstream influence on the solution [7, 8] , it is 
unclear as to whether these boundary conditions are adequate for a full 
laminar-turbulent transition simulation. If the parallel flow assumption is 
made, periodic boundary conditions are applied in the streamwise direction. 
The initial condition is that of a small disturbance superimposed upon fully 
developed plane Poiseuille flow. 

Algori thm formulation ♦ - The solution algorithm is based on a time- 

splitting scheme in which the solution is advanced from t = t n to t = t n+1 as 
follows: in the first step, 

2 A 

u^ = u x a) + (1/Re) V u - fi (2) 

*v t > “S. 

is integrated from t n to the intermediate time t*; in the second step, 

= -VP (3) 

•'•t 

V o u = 0 

•N. 

is integrated from t* to t n+1 • Application of the boundary conditions for 
this scheme is discussed in detail by Zang and Hussaini [9], and will not be 
covered here. However, it should be noted that whereas a staggered grid is 
used for the pressure in the formulation of [9], a non-staggered grid is used 
in this formulation. Use of the non-staggered grid not only changes the form 
of the spatially discretized equations, but also necessitates the application 
of a consistent artificial boundary condition for the pressure [10]. 
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pressure*) Upon solution of this Poisson equation for the pressure, the 
velocity is corrected from 

» nt1 ' 2 . u* - w n *' /2 <9> 

*N» *N. 

The pressure correction step after the second Runge-Kutta stage is treated in 
a similar manner* 

Formulation of the algorithm is completed by spatially discretizing the 
equations* Since the NSC is tailored towards algorithms in which the computa- 
tions are localized, central-differencing is used in the spatial discreti- 
zation of this formulation* The computational grid for this problem is 
Cartesian with uniform spacing in the streamwise and spanwise directions, and 
arbitrary stretching in the vertical direction* Grid stretching is used in 
the vertical direction so that grid points can be concentrated near the walls, 
where gradients in the primitive variables are especially large* 

Solution procedure , - The procedure for advancing the solution from 
t = t n+1 begins with the calculation of the three components of the advection 
term, 

„n n n - T 

F = u x a) - fi (10) 

v "v 

The second step of the procedure entails calculating the right hand side of 
the Helmholtz equation for the x velocity component, and then solving the 
Helmholtz equation to update u to the intermediate time level t** This proce- 
dure is then repeated for the y and z velocity components* Denoting the right 
hand sides of the three Helmholtz equations as L, then the equations are 

L n = u n + (h/2) F n + (h/4Re ) 7 2 u n (11) 

% % % ■'* 
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( 12 ) 


u* - (h/4Re ) V 2 u* = L n 

■w «v •*. 

Upon updating the velocity components, the right hand side of the Poisson 
equation is calculated, and then the Poisson equation is solved for P nH ^ 
where 


K* = 7 o u* 

(13) 

v 2 p n+1/2 = K * 

(14) 


Corrected values of the velocity components are then determined from 


u n+1/2 = u* - VP n+1/2 


(15) 


the above procedure is then repeated for the second stage of the algorithm to 

advance the solution to t n • The only difference in the second stage calcu- 

n*4"1 / 2 

lations is that F° is absorbed in the calculation of F ' , as 

•N. 

_n+1/2 n+1/2 n+1/2 .... 

F ' = u xuj ' - fi - (1/2) F (16) 

As mentioned previously, two iterative methods for solving the Helmholtz 
and Poisson equations are considered in this analysis. The first method is 
Red-Black SOR. This is an explicit two-color point method in which a point 
Jacobi type iterative scheme is utilized. A complete iteration involves 
updating red points (i+j+k odd) first, and then updating black points (i+j+k 
even) using the latest available values in calculation of the residual. 

The second method is ZEBRA 1 , which was developed as a means for perform- 
ing line relaxation in a vectorizable manner [11]. It is a two-color line 
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scheme in which line SOR is performed along alternating lines, where the grid 
in planes normal to those lines has a checkerboard pattern. In this algorithm 
the line relaxation is performed in the vertical direction, which is the 
direction of grid stretching. The tridiagonal systems of equations generated 
for the vertical lines are first solved along red lines (i+j even), and then 
solved along black lines (i+j odd) using the latest available values in the 
residual calculation. The tridiagonal systems of equations are solved using 
the Thomas algorithm. 

This section of the paper is concluded with a brief discussion on 
enforcement of the boundary conditions, since this subject has been avoided up 
to now. In many algorithms, and in this algorithm in particular, enforcement 
of the boundary conditions involves minor modifications to the discrete 
equations at the boundaries. These modifications generally consist of either 
adding or deleting a few terms from the discrete equations, without changing 
the form of the equations. On the Nsr. i ncorpora ti on of snob modif ioa tions to 
the RALU pipelines involves a reconfiguration in which a few processing ele- 
ments are either added to, or deleted from, the pipeline. Since the RALU is 
reconfigurable every clock period, in most cases the reconfiguration may be 
performed without interrupting the routing of operands through the pipeline. 
Therefore, on the NSC, the treatment of boundary conditions has little or no 
effect on the implementation of computational procedures, or on the computa- 
tion rate. For this reason, and in order to keep the remainder of this analy- 
sis as simple as possible, the treatment of boundary conditions is avoided in 
subsequent sections. 
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Implementation of the Algorithm 


Implementation of the aforementioned algorithm begins with a determi- 
nation of how the computational domain is to be distributed over the Nodes. 
For now it is assumed that an LxMxN computational grid is subdivided into 128 
subdomains of dimension IxJxK, where I = L/8, J = M/4, and K = N/4. In order 
to simplify the computational procedures, it is further assumed that I is an 
even number, and that J and K are multiples of 4. Conceptually, the computa- 
tional domain is mapped into a three-dimensional lattice of Nodes. For 
finite-difference algorithms then, each interior Node need communicate only 
with its adjacent Nodes. Boundary Nodes (i.e. Nodes into which grid points 
from the boundary of the computational domain have been mapped) must also 
communicate with non-adjacent Nodes in situations where periodic boundary 
conditions are to be enforced. The hypercube network does not provide direct 
links between all adjacent Nodes in a three-dimensional lattice, or between 
apposite non-adjacent boundary Nodes. 

With this mapping of the computational domain, the Nodes typically per- 
form identical processes. In fact, the only situation for which processes 
differ is when boundary Nodes are evaluating terms in which the boundary 
conditions are enforced. As discussed in the section on algorithm formu- 
lation, treatment of the boundary conditions requires only minor modifications 
to the RALU pipelines. Hence, the procedures for implementing the. algorithm 
in the boundary Nodes are nearly identical to the procedure implemented in the 
interior Nodes. 
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On the nodal level, a procedure for allocating memory planes to store the 


variables must be chosen. This allocation procedure is crucial to efficient 
implementation of the algorithm as it not only affects the ordering of values 
in the operand and result vectors, but also influences the actual configu- 
ration of the RALU pipelines. In the following analysis, the same allocation 
procedure is used for the Red-Black SOR and ZEBRA 1 solutions of the Helmholtz 
and Poisson equations. Hence, the procedures for computing the explicit terms 
in the algoithm are identical for the two iterative methods. 

Schematics illustrating the memory plane allocation procedure for storing 
the primitive variables u, v, w, and p are presented in figures 6, 7, 8, and 
9, respectively* Here, two-dimensional subsets of the grid in x-y planes are 
illustrated. The two numbers above each grid point denote the (i, j) indices 
of the point; the number below each grid point denotes the memory plane in 
which the particular variable is stored. Grid points on the dashed lines 
indicate points which have been mapped into adjacent Nodes. The sequences for 
allocating the memory planes in the vertical direction are repeated for every 
fourth x-y plane. 

To illustrate the ordering in which a variable is stored within a given 
memory plane, storage of the primitive variable u within memory plane 0 (MPO) 
is considered. From figure 6, with k set equal to 1, grid point (1, 1, 1) is 
the first point for which u is stored in MPO. Consecutive grid points are 
determined by moving across the grid in x, then up the grid in y. Hence, 
consecutive memory locations in MPO contain the values of u for grid points 
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( 1 / 1 / 1 )/ (2 , 1 , 1 ) r ( 3 , 1 , 1 ) 


(I# If 1) 


(1, 5, 1 ), (2, 5, 1), (3, 5, 1), (If 5, 1), 

(1, J-3, 1), (2, J-3, 1), (3, J-3, 1), (If J-3, 1) 

Moving up the grid to plane k=3, the next set of consecutive memory locations 
in MPO contain the values of u for grid points 


(1 , 

3, 3), (2, 3, 3), 

(3, 

3, 

3), 

(I, 

3, 3) 

(1, 

7, 3), (2, 7, 3), 

(3, 

7, 

3), 

(I, 

7, 3) 

(1 , 

J-1, 3), (2, J-1, 

3), 

(3, 

J-1, 3), 

* * * * 

(I, 


This sequence for storing the values of u is repeated for planes k=5 and 
7 , k=9 and 1 1 , etc*, up through planes k=K-3 an k=K-1 • It follows that the 
values of u for one-eighth of the grid points are stored in MPO* The remain- 
ing values of u are distributed over MPs 1-7, using similar procedures to 
assign the array elements to consecutive memory locations. 

Comparison of figures 6, 7, 8, and 9 indicates that at every grid point 
for which u is stored in MPO, the values of v, w, and P are stored in MPs 8, 
2, and 10, respectively* Consequently, the sequence for storing the values of 
v,w, and P at these grid points is identical to the sequence for storing u* 
Similar relationships between the variables, and the memory planes in which 
the variables are stored, hold at all other grid points. 

The manner in which the variables are stored suggests a natural ordering 
for values in the operand and result vectors. In general, the sequential 
order of values in a given vector is the same as the order in which those 


- 22 - 



values are stored* This natural ordering gives vector lengths of IxJxK/8. 
There are two occasions when orderings other than the natural ordering occur* 
One occasion is when elements of an operand vector are accessed as BPD data 
from neighboring Nodes, in which case those values are inserted into the 
vector as the operands stream through MAS NET* The other occurrence is in the 
iterative solution of the Helmholtz and Poisson equations, as will be dis- 
cussed later in this section* 

Calculation of the Advection Term * - The first term to be calculated is 
the advection term for the first stage of the Runge-Kutta/Crank-Nicholson 
temporal discretization* Using second-order central differencing in the 
spatial discretization of Eg. ( 10 ), and denoting the (x, y, z) components of 
the advection term and the vorticity vector as F = (F,G,H) and w - (£ n, ?) , 
the components of the advection term may be written as 


„n n n n n * 

F, = v. 5. .. - w, T), , t - f 

13k ljk^ijk 13k 13k 


( 17 ) 


n n -n n n 

ijk ~ W ijk^ijk ijk^ijk 


( 18 ) 


„n n n n -n 

H ijk ” U ijk n ijk V ijk ijk 


where 


£ n * (w n , - w 11 , )/2Ay - ( dh. v? . , „+ dm.v? + dlv n ) 

^ijk ij+lk ij-1k ;/ * k i3k+1 k 13k Tc 13k-! 


* d v”jk + d Vijk-i - (w S! + ijk - v>jk ,/2to 


n ,, n 

n. = dh, u. .. . 

ijk k i]k+1 


«?ik ' '^.jk - ''?-, j k>' 2 “ - ( "" jt ,k - “ij-tk 1 '' 2 ^ 


( 19 ) 

( 20 ) 
(21 ) 
( 22 ) 


and dh k , dm^, and d)^ denote the coefficients for the first derivative in the 
strectched coordinate direction at plane k. 
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Calculation of the advection term components is performed using a two- 
step procedure. In the first step, r i, £, and F are computed at one-eighth of 
the grid points. The computation begins for the grid point (1, 1, 1). 

Consecutive values in the result vectors will be for grid points (1, 1, 1), 

(2, 1, 1), (3, 1, 1), etc. (i.e., the same series of grid points for which u 
is stored in MPO). From Bgs. (17), (21), and (22), calculation of r\, £, and F 
at grid point (i, j, k) requires the values of u^^, u i: - +1k , u i= ._ 1k , u i j k+1 , 

u ijk-1 ' v ijk' v i+1jk' v i-1jk' w ijk' w i+1jk' an< * w i-1jk* Looking at an 
interior grid point associated with the result vectors, say point (2, 5, k) 

(see fiqures 6, 7, and 8), the above operands reside in MPs 0, 1, 3, 4, 6, 8, 
8, 8, 2, 2, and 2, respectively. Likewise, at grid point (3, 5, k), the 

operands reside in the next consecutive memory location of their respective 
memory planes. This indicates that except for those operand values which are 
accessed as BPD data from adjacent Nodes, the values for a particular operand 
vector are accessed from consecutive memory locations of the same memory 
plane. 

It should also be noted that the vectors for v^j^, v i+ljk' anc * v i-1jk are 
all generated from MP8. The process for generating these vectors, called 
"vector latching," utilizes the vector delay and regeneration capabilities of 
MASNET. In the vector latching process, the values streaming out of MP8 
constitute the values of the vector for v ^ + -|j] c * As these values are routed 
through one of the switch elements in MASNET, they are also stored in the 
local memory of the element. One clock period later, the values are routed 
out of the element as a second vector. This second vector contains the values 
of which have been offset from the v^ +1 j k values by one value. At the 
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next switch element, this process is repeated to generate the vector for 
v i-1jk* Vector latching is also used to generate the vectors for 

w ijk' and w i+1jk from MP2# 

An illustration of the RALU pipeline configuration for calculating rj, £, 
and F is presented in figure 10. Here, the operand memory planes, and the 
variable stored in each plane, are indicated at the top of the figure. The 
vector latches for v and w are indicated by the paths the operands take in 
MASNET. In this 17 operation RALCJ pipeline, the values of n and £ are calcu- 
lated in independent pipelines. Upon calculation of these terms, they are 
routed both to memory, and to the remainder of the pipeline for calculating 
F. The results for r), l and F are stored in to MPs 9, 11, and 12 
respectively. 

The second step of the advection term calculation procedure is to calcu- 
late G, and H at the same grid points for which r\, £, and F are calculated 
in the previous step. From Egs. (18), (19), and (20), calculation of these 

terms at grid point (i, j, k) requires the values of v ijk* v i jk+1 ' 

v ijk-1' w ijk' w ij+1k' w ij-1k' and ^ijk • From figures 6, 7, and 8, the 

values of the first seven operands reside in MPs 0, 8, 12, 14, 2, 3, and 1, 
respectively. From the previous step, the values of n and £ reside in MPs 9 
and 11, respectively. The 14 operation pipeline for performing the second 
step computations is illustrated in fiqure 11. As indicated at the bottom of 
the figure, the results for G and H are stored in MPs 4 and 13, respectively. 

At this point in the computation, the three components of the advection 
term will have been computed at one-eighth of the grid points. In order to 
calculate the advection term at the rest of the grid points, this two-step 
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procedure is then repeated seven times. The sequence in which the first and 
second step calculation procedures are performed is outlined in tables 1 and 
2. Listings of the memory planes in which the operand and result values for 
each step are stored are also presented in these tables. 

In order to project the amount of time it would take for calculation of 
the advection term, the startup time, actual computation time, and internode 
communication delay time must be determined. In the various steps of the 
calculation procedure, once the pipeline is full, and assuming there are no 
delays, results are generated every clock period. Consequently, for a given 
step of the procedure the actual computation time, in clock periods, is 
IxJxK/8 (i.e., the length of the vectors). Hence, the actual computation time 
for the advection term computation is 2 IxJxK clock periods. 

For the first step in the procedure, startup time is accrued as the first 
values of the operand vectors are accessed from memory and routed through 
MASNET and the RALU, and as the first values for the result vectors are routed 
back through MASNET and stored in memory. The initial startup time for this 
process is 

1 to access operands from memory 

8 to route operands through MASNET, including the vector latch 

7 to route operands through the RALU 

7 to route results through MASNET 
1 to store results in memory 
24 clock periods 

Additional delay time is accrued whenever the constant parameter latches which 
contain the coefficients for the first derivative in the stretched coordinate 
direction must be reset. For instance, after the operands for grid point 
(I, J-3, 1) have been processed through the first row of floating-point 
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processing elements in the RALU, the next operands to be processed in these 
elements are for grid point (1, 3, 3). Since the value of the k index has 
changed, the constant parameter latches containing the values of dh^, dm 
and d£^ must be reset before the computation can be continued. It takes 6 
clock periods to reset these latches each time the computations proceed to 
another vertical plane. Lumping this delay time with the initial startup 
time, the total startup time for the first step is 24+6 (K/2-1 ) clock periods. 

The initial startup time for the second step of the procedure is 22 clock 
periods. The reason for this delay is that MPs 9 and 11, which are used to 
store results in step 1, are accessed for operands in step 2. Since a memory 
plane cannot be accessed for operands while it is being used to store results, 
operands from MP 9 and 11 cannot be accessed until the last results from step 
1 have been stored. Consequently, the computations must be suspended until 
the first operands for the second step have been routed through MASNET. 
However, in step 3, none of the memory planes which are accessed for operands 
have been used to store results in step 2. Therefore, the only initial start- 
up time for step 3 is the time to reset the constant parameter latches. 
Determining the startup time for the other 13 steps in a similar manner, the 
total startup time for calculation of the advection term is found to be 
162+48K clock periods. 

In computing the advection term, BPD data for the variables u, v, and w 
must be present in the boundary caches of the Nodes. Considering the variable 
u, boundary -point values are required at all (i, 1, k), (i, J, k), (i, j, 1), 
and (i, j, K) grid points, for a total of 2(IxJ+IxK) values. Similarly, the 
number of v and w boundary-point values required are 2(JxK+JxI) and 
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2( KxI+KxJ ) , respectively. Thus, a total of 4( IxJ+JxK+KxI ) boundary -point 
values are required for computing the advection term at all of the grid 
points. Subdomains of dimension 420x420x240 are utilized in one of the 
examples presented at the end of this section. For subdomains of this size, 
computation of the advection term requires around 1.5x10 6 boundary -point 
values. However, the boundary cache has a 1 Mword storage capacity, so it is 
not possible to store all of the requisite BPD data within the boundary cache, 
prior to beginninq the computations. 

In order to avoid situations where BPD data is not in the boundary cache 
of the Node at the time it is required, the advection term is computed at one- 
half of the grid points at a time. Thus, before beginning the computational 
procedure, the boundary cache of each Node is reloaded with the BPD data for 
performing the computations at the first half of the grid points (i.e. steps 
1-8 in tables 1 and 2). For an interior Node, this requires the communication 
of 2( IxJ+JxK+KxI) boundary-point values to its adjacent Nodes. Utilizing the 
scatter-gather capabilities of the Nodal memory, and accessing the BPD data 
from four memory planes at a time, the delay for this process is 
1 /2( IxJ+JxK+KxI ) clock periods. Upon completion of the computations at the 
first half of the grid points, computations are suspended while the boundary 
caches are reloaded with the BPD data for computations at the second half of 
the grid points. The total internode communication delay for reloading the 
boundary caches is IxJ+JxK+KxI clock periods. There are no other internode 
communication delays in the advection term calculation procedure. 

Combining the startup time, actual computation time, and internode com- 
munication delay time, the total time for computation of the advection term is 
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164+{48+2IxJ)K+IxJ+JxK+KxI clock periods. The operation count for the pro- 
cedure is 31 IxJxK. Neglecting the startup cost and internode communication 
delay cost, the nominal operation rate of the procedure is 310 MFLOPS. 

Calculation of the Remaining Explicit Terms . - The next term to be calcu- 
lated is the right hand side of the Helmholtz equation for the x velocity 
component. Using second-order central differencing in the spatial discreti- 
zation of Eg. (11), and denoting the components of L as L = {L, M, N), then 
the x velocity component of the equation may be written as 


L i jk - u i jk + ,h / 2,F i jk + < h ' ,4Re| (<u i + , jk + 

* ‘“ijtik + u ij-ik ) '' ly2 + “V"ijk+i + Vijk + D Vijk-,> 


(23) 


where 


B k = Dm k ” 2/Ax2 “ 2 / A y 2 

and Dh^, Dm^, and denote the coefficients for the second derivative in the 

stretched coordinate direction. 

The first step of the calculation procedure is to compute the values of L 
for the vector beginning at qrid point (1, 1, 1). From figure 6, the values 
of u for this computation reside in MPs 0, 1, 3, 4, and 6. As indicated in 
table 1, the values for F are accessed from MP 12. The 15 operation RALU 
pipeline for calculating the right hand side of the Helmholtz equation is 
illustrated in figure 12. As indicated at the bottom of the figure, the 
results for L are stored im MP 8. 

The sequence for the remaining 7 steps of the calculation procedure, and 
the memory planes utilized in each step, are listed in table 3. The startup 
time and actual computation time for the procedure are 18+24K clock periods 
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and IxJxK clock periods , respectively. The internode communication delay 
time, which is accrued while the boundary cache is reloaded with the 
2( IxJ+JxK+KxI ) boundary-point values of u required in the computations , is 
1 /2(IxJ-KJxK+KxI) clock periods. Note that for subdomains of dimension 
420x420x240, the storage capacity of the boundary cache is large enough to 
store all of the BPD data for u. The operation count for this procedure is 15 
IxJxK, and the nominal operation rate is 300 MFLOPS. 

The next step in the solution algorithm is to solve the Helmholtz 
equation for the x velocity component. The Red -Black SOR and ZEBRA 1 
iterative procedures for solving the Helmholtz equation are discussed later in 
this section. Upon solution of this equation, the next steps in the algorithm 
are to compute the right hand side of the Helmholtz equation, and then solve 
the equation, for the y velocity component, and then for the z velocity compo- 
nent, The procedures for computing the right hand side of the Helmholtz 
equation for the y and z velocity components are identical to the procedure 
for the x velocity component, although different memory planes as utilized in 
the various steps of the procedures. The total time to compute the right hand 
sides of the three Helmholtz equations is 54+ ( 72+3IxJ )K+3/2 ( IxJ+JxK+KxI ) clock 
periods, and the operation count is 45 IxJxK. 

The next step in the algorithm is to compute the right hand side of the 
Poisson equation for pressure. Spatially discretizing Eq. (13), the term may 
be written as 

K ijk ■ (u i+ijk - “i-ijk 1 ' 24 * + <v ijtik - v ij-ik ,/24y (24) 
+ dh k”ijk« + d \ u !jk * dt k w *jk-i 
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In computing the values of K, 7 operands are required and one result is gener- 
ated, using 8 of the MASNET output ports. Since 16 output ports are avail- 
able, it is possible to compute this term for two distinct vectors at the same 
time. The 22 operation RALU pipeline for performing this computation is 
illustrated in figure 13. Here, the left side of the pipeline computes the 
values of K for the vector beginning at grid point ( 1 , 1 , 1 ), and the right 
side computes the values of K for the vector beginning at grid point 
(1, 2, 1). The sequence for the steps of the calculation procedure, and the 
memory planes utilized in each step of the procedure, are listed in table 4. 
The total time to compute the right hand side of the Poisson equation is 
34+24K+ ( 1 /2 ) ( IxJ xK+IxJ 4J xK+KxI ) clock periods. The operation count is 11 
IxJxK, and the nominal operation rate is 440 MFLOPS. 

The next step in the algorithm is to solve the Poisson equation for the 
pressure. As for the Helmholtz equation, the iterative procedures for solving 
the Poisson equation are discussed later in this section. 

The last procedure for the first stage of the Runge-Kutta/Crank-Nicholson 
temporal discretization is to correct the velocity values using the updated 
values of the pressure. From the spatial discretization of Eq. (15), the 
three velocity correction equations may be written as 


n+ 1/2 

u. ' = u* 

13 k 13 k 

n+ 1/2 . 

v. ' = v* 

13 k 13 k 

n+ 1/2 

w. . ' = w* 

13 k 13 k 


- p ^ )/2Ax 
- p ^/ 2Ay 

(d V«ifr dm k p ^k /2+ d V±Ji-? ) 


(25) 

(26) 
(27) 
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The 12 opeation RALU pipeline for performing the velocity correction computa- 
tions is illustrated in figure 14. In this figure, the memory planes indi- 
cated are for the computation of u, v, and w for the vectors beginning at grid 
point (1, 1, 1). The sequence for the eight steps of the calculation proce- 
dure, and the memory planes utilized in each step, are listed in table 5. The 
startup time and actual computation time for the procedure are 1 20+48K and 
IxJxK clock periods, respectively. Note that the step prior to the velocity 
correction procedure is the solution of the Poisson equation for pressure. 
Therefore, the boundary cache already contains the updated pressure values 
which are accessed as BPD data in the velocity correction procedure. As a 
result, there is no internode communication delay for this procedure. 

Upon completion of the velocity correction procedure, the values for u, 
v, and w no longer reside in the memory planes in which they were stored 
initially. In order to simplify the implementation of subsequent procedures 
in the algorithm, these values are transferred back to their initial memory 
locations. As indicated in table 5, the values of u, which were initially 
stored in MPs 0-7, reside in MPs 8-15. The memory plane -to -memory plane 
transfers of these values, which are implemented by routing the vectors 
through MASNET, may be performed for all eight vectors simultaneously. It 
takes IxJxK/8 clock periods to perform this procedure. The values for v and w 
are treated in a similar manner. Combining this transfer time with the start- 
up time and actual computation time, it takes 1 20+48K+1 • 375 IxJxK clock 
periods to implement the velocity correction procedure. The operation count 
for the procedure is 12 IxJxK, and the nominal operation is 174 MPLOPS. 


- 32 - 



At this point in the algorithm, the primitive variables have been 
advanced to the t=t n+1 / 2 time level. The next steps in the algorithm are to 
perform the computations for the second stage of the Runge-Kutta/Crank- 
Nicholson temporal discretization, to advance the solution to t=t n+1 . Compar- 
ison of the equations for the two stages reveals that the only differences 
occur in calculation of the advection term, as indicated by Eqs. (10) and 
(16). However, the additional term which appears in each of the three 
advection term component equations is easily treated with only minor modifica- 
tions to the RALU pipelines. Hence, the calculation procedures for the second 
stage are nearly identical to those for the first stage, and will not be 
discussed here. 

For advancement of the solution from t=t n to t=t n+1 , the time involved in 
computing the explicit terms is 522+348K+1 3.75 IxJxK+6 ( IxJ-tJxK+KxI ) clock 
periods. The operation count for these computations is 204 IxJxK. Neglecting 
the startup time and internode communication delay time, the sustained opera- 
tion rate is around 297 MFLOPS. 


Red-Black SOR Solution of the Helmholtz and Poisson Equations : Red-Black 
SOR is a two-color point method in which a point Jacobi type iterative scheme 
is utilized. A complete iteration entails updating red points (i+j+k odd) 
first, and then updating black points (i+j+k even) using the latest available 
values in calculating the residual. Considering the first Runge-Kutta stage 
of the algorithm, the Helmholtz equation for the x velocity component is 
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u* - (h/4Re ) vV = L n 


(28) 


After second-order central differencing Eg. (28) , the residual eguation may be 
written as 


(«e/h)Rj jk = (4R«/h)Lj. k * (^Ulc^i-jk’ 4 * 2 + (u ij + 1k +u iJ->k )/4y: 


(29) 


m 


Dh k u ijk + 1 + D \ u ijk-1 - Vi 


jk- 


L jk 


where 

B k = + 2 / Ax ^ + 2/Ay 2 - Dm^ , 

v = m for red points , v = m+1 for black points , and m denotes the iteration 
level. Applying successive over-relaxation (SOR), the update equation becomes 


m+1 m , . - » _ m 

u ijk ■ u ijk + “ |4Re/h)R ijk /B k 


(30) 


where tu denotes the relaxation parameter. 

The first step of the computational procedure is to update the values of 
u for the vector beginning at grid point (1, 1, 1), at red points only (i.e. 
grid points (1 , 1 , 1), (3, 1/ 1), (5, 1, 1 ) , etc.). Hence, consecutive values 
in the operand vectors are accessed from every other sequential memory loca- 
tion of the appropriate memory planes, rather than from consecutive memory 
locations. This gives vector lengths of IxJxK/16 for each step of the compu- 
tational procedure. The 17 operation RALU pipeline for calculating the resid- 
uals, performing an in-line local convergence check, and updating the values 
of u is illustrated in figure 15. The memory planes indicated in the figure 
are for the first step of the procedure, where the values of u ^j k 

accessed from MPO and the values of u™T^ are stored in MP 15. 
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Subsequent steps in the computational procedure involve updating the 
values of u for the remaining red points, and then updating the values of u 
for the black points. The grid point at which the computations are begun for 
each step, and the memory planes utilized, are listed in table 6. 

In order to simplify the implementation of subsequent iterations, the 
updated values of u are always transferred back to their initial memory loca- 
tions. Hence, each step of the procedure also involves a memory plane-to- 
memeory plane transfer of the updated values via MASNET, as indicated at the 
right of figure 15. The memory planes utilized in the intranode data trans- 
fers, and the sequence in which the transfers occur, are listed in the last 
two cloumns of table 6. As indicated, the values of u which are updated in 
the first step, are transferred back to their initial memory locations in the 
third step. 

The iterative solution of this Helmholtz equation requires 2( IxJ+JxK+KxI ) 
boundary -point values of u per interior Node. For the computational sub- 
domains considered in this analysis, all of the BPD data may be stored within 
the boundary cache. As an iteration proceeds and values of u are updated, the 
updated boundary -point values are immediately routed to their destination 
Nodes, replacing the old values. For a global mapping of the computational 
grid as specified at the beginning of this section, the BPD data generated 
from updating u at red points is not used in the destination Nodes until black 
points are updated, and vice versa. For the sequence of computations speci- 
fied in table 6, the time between updating a bound ary -point value and 
utilizing that value in the destination Node is at least IxJxK/4 clock 
periods. Thus, the BPD data is pre-commumicated long before it is required in 
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an ongoing computation of a destination Node, but is never communicated before 
the old values have been utilized. This not only ensures proper operation of 
th£ iterative method, but also ensures that there are no internode communi- 
cation delays for the computational procedure, 

*flie total computation time for performing a Red-Black SOR iteration, 
including the intranode transfer of all the updated values back to their 
initial memory locations, is 36+48K+ (1+1/16) IxJ xK clock periods. Assuming it 
takes h r iterations to attain global convergence of the solution, the total 
computation time for solution of the Helmholtz equation is h r ( 36+48K+IxJxK) 
+1/16(IxJxK) clock periods. The operation count for the procedure is 17 h r 
IxJxK. Ihe nominal operation rate is 340 MFLOPS. 

The procedures for solving the Helmholtz equations for the y and z 
velocity components are identical to that for the x velocity component, 
although different memory planes are utilized in the various steps. Solution 
of the Poisson equation is also quite similar, in that only minor modifica- 
tions to the RALU pipeline must be made. After central-dif ferencing Eq. (14), 
the residual and update equations for the Poisson equation are 


R® = (P V .. + )/Ax 2 + (p. v . + p v . )/Ay 2 

13 k i+ 1 ;)k i- 1 ;jk i]+ 1 k 13 -Ik 


D Vi 


m 


jk +1 k i;jk -1 k ljk 13 k 


_nH-1 m , . 

ijk ~ P ijk a)R ijk^ B k 


where 


B k = 2/Ax 2 + 2/Ay 2 - Dm k 


(31 ) 


( 32 ) 
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Comparison of Bqs. (31) and (32) with Eqs. (29) and (30) indicates that there 
is one less operation in the RALU pipline for the Poisson equation. However , 
the computation time for performing an iteration is the same as that for the 
Helmholtz equation. Assuming that p r iterations are required to attain global 
convergence of the solution, the total computation time for solving the 
Poisson equation is p r ( 36+48K+IxJxK) +( 1/16)IxJxK clock periods, the operation 
count is 16 p r xIxJxK, and the nominal operation rate is 320 MFLOPS. 


ZEBRA 1 Solution of the Helmholtz and Poisson Equations : ZEBRA 1 is a 

two-color line method in which line SOR is performed along alternating 
vertical lines. The tridiagonal systems of equations generated for the 
vertical lines are first solved for red lines (i+j even), and then for black 
lines (i+j odd) using the latest available values in computing the 
residuals. Once again considering the Helmholtz equation for the x velocity 
component at the first Runge-Kutta stage, the residual equation for ZEBRA 1 
may be written as 


URe/hjR^ = (4Re/h)L^ jk 4 4 u”_, Jk )/tx 2 * <»i j+1k 4 u”. _, k > 4y 2 


m _ m __ . m 

+ Dh k U ijk+1 ~ k U ijk + \ U ijk-1 


( 33 ) 


where 


= (4Re/h ) + 2/ lax 2 + 2/ Ay 2 - Dm^ , 
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v = m for red lines, and v = m +1 for black lines. After applying line SOR, 


the update equation, written in delta form, becomes 


+ ( V“ )Au i;jk - D V U ? jk+ 1 


( 4 Re/h)R*. k 


( 34 ) 


where 


. m m+1 m 

Au. M = u. - u. 
13k 13k 13k 


and a) denotes the relaxation parameter. Utilizing the Thomas algorithm to 
solve the tridiagonal systems of equations, a two-step procedure is developed 
for updating the velocity. In the first step 


E i jk ‘ ««“/»»<? j k + D V , i j k-1 ,/( V" ' D ViJk-1> ,35) 

p ijk- D V'V“- “Vijk-i 1 1361 

are calculated from k = 1 up through k = K. In the next step 


. m _ _ . m 

Au. = E. . + F . Au 

13k 13k 13k I3k+1 


m+1 m , . m 

u. = u. + Au. 

13k 13k 13k 


( 37 ) 

( 38 ) 


are calculated from k = K down through k = 1 . 

The computational procedure begins with the calculation of E and F for 
the vectors beginning at grid point (1, 1, 1), at those grid points which lie 
along red lines. Consecutive values of the operand vectors are thus accessed 
from every other sequential memory location of the appropriate memory planes. 
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Subsequent steps in the procedure involve the calculation of E and F at grid 
points which lie along this same subset of red lines, moving up the grid in 
the vertical direction. Due to the recursive nature of the Thomas algorithm, 
the computations proceed up the grid in consecutive x-y planes. Hence, for 
the memory plane allocation procedure utilized herein, the computations may 
only be vectorized accross x-y planes, and the vector lengths are IxJ/8. Upon 
completion of the calculation of E and F at all grid points which lie along 
this subset of red lines, Au m and u m+1 are calculated along the same lines, 
proceeding down the grid in the vertical direction. Again, the vector lengths 
for this procedure are IxJ/8. Upon completion of these computations, u will 
have been updated along one-fourth of the red lines. 

The 22 operation RALU pipeline for calculating the residual, performing 
an in-line local convergence check, and computing the values of E and F is 
illustrated in figure 16. The memory planes indicated are for the vectors 
beginning at grid point (1, 1, k). The sequence of steps in the procedure for 
computing E and F for this subset of red lines, and the memory planes utilized 
in each step, are listed in table 7. As indicated at the right of the figure 


and in the last two columns of table 7, the computational procedure also 
involves an intranode data transfer of the values of to a memory plane 

for temporary storage. This memory plane-to-memory plane transfer is per- 


formed so that in the second step of the procedure, where values of u_^ are 
calculated, the updated values may be stored in their respective initial 


memory locations. 

The next step in the procedure entails computing Au m and u m+1 at grid 
points along the subset of red lines for which E and F are calculated in the 
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previous step* Hie 9 operation RALU pipeline for performing this calculation 


is illustrated in figure 17 * In this pipeline, the computations are performed 
in 3 consecutive x-y planes, simultaneously* Beginning at the left of the 
figure, values of are routed to the RALU for use in the computation of 

Au*jV^ + 2 • The computed values of Au^.^ + 2 then branched off to two indepen- 
dent pipelines. In the left pipeline, the values of are com P ute ^ and 

then routed to memory; in the right pipeline the computations for Au™ are 

XjivT I 

begun* This process is repeated twice in order to compute u*!*T^ and u™ + ^ • 

i^k+1 13k 

As indicated at the bottom right of the figure, the values of Au?., are 

13k 

routed out through MASNET and temporarily stored in the vector delay and 
regeneration unit* In the subsequent step, these values are then routed back 
to the RALU for use in calculating Au™^ ^ • The sequence of procedures for 
updating u for this subset of red lines, and the memory planes utilized in 
each step, are listed in table 8. Note that the sequence of calculations are 
repeated every 12 x-y planes. 

At this point in the computations, the values of u will have been updated 
at the grid points for one-fourth of the red lines. Subsequent steps in the 
procedure entail updating the values of u for the remaining red lines, and 


then for black lines* 


For the previously specified mapping of the computational grid into the 
Nodes, where K = N/ 4 , the grid points which lie along a given vertical line 
are mapped into four different Nodes. For the ZEBRA 1 computational procedure 
described above where the computations proceed up, and then down the grid in 
consecutive x-y planes, three of the four Nodes will be idle at any given 
time. In order to avoid this situation, a different mapping of the computa- 
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tional grid is utilized in which the subdomains of dimension IxJxK have dimen- 
sional lengths of I = L/16, J = M/8, and K = N. To further simplify the 
implementation of the procedures, K is assumed to be a multiple of 12. 
Conceptually, the computational domain is now mapped into a two-dimensional 
lattice of Nodes, where all grid points along a given vertical line are mapped 
into the same Node. The hypercube network then provides direct links between 
all adjacent Nodes in the lattice, and for interior Nodes, the number of 
boundary-point values generated for each primitive variable is 2(IxK+JxK), 
rather than 2(IxJ-KJxK+KxI) • This new mapping of the computational domain has 
little effect on the computational procedures for the explicit terms, particu- 
larly since the Nodal memory plane allocation procedure remains unchanged. 

As with the Red-Black SOR iterative method, the internode communication 
requirements for the ZEBRA 1 method are such that there are no inter node 
communication delays for this computational procedure. Assuming it takes h z 
iterations to attain global convergence of the solution, the total computation 
time for solution of the Helmholtz equation is h z (1600 + 4IxJ)K/3 clock 
periods. The operation count is 25 h z IxJxK, and the nominal operation rate is 
375 MFLOPS • 

The procedures for solvinq the Helmholtz equations for the y and z veloc- 
ity components are identical to that for the x velocity component. Solution 
of the Poisson equation is also quite similar, although there is one less 
operation in the calculation of the residual equation. Assuming that p z 

iterations are required to attain global convergence of the solution, the 
total computation time for solving the Poisson equation is p z (1600 + 4IxJ)K/3, 
the operation count is 24 p z IxJxK, and the nominal operation rate is 360 
MFLOPS* 
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Projected Timing Results; The actual computation time for advancing the 
solution from time t n to t n+1 is dependent upon the size of the grid sub- 
domains, the relative dimensions of the subdomains, and the iterative method 
used to solve the Helmholtz and Poisson equations* The size of the grid 
subdomains which may be mapped into the Nodes depends upon the effective 
number of variables which must be stored per grid point, including the tempo- 
rary storage of intermediate results. For the computational procedures 
detailed herein, both the Red-Black SOR and ZEBRA 1 algorithms require the 
effective storage of 12 variables per grid point* 

The relative dimensions of the LxMxN computational grid are dictated by 
the directional resolution requirements for the particular problem to be 
solved. For a true spatially developing flow, the projected resolution 
requirements suggest dimensions of length L = 4xN and M * 2xN • Considering a 
computational grid of dimension 3360 x 1680 x 960, which contains roughly 

Q 

5.4 x 10 grid points, then the grid subdomains for the Red-Black SOR and 
ZEBRA 1 algorithms are of dimension 420 x 420 x 240 and 210 x 210 x 960, 
respectively* Storinq 12 variables per grid point, subdomains of this size 
require about 508 Mwords of storage, utilizing about 99.2% of the available 
Nodal storage capacity. 

The projected timing results for simulation of the true spatially devel- 
oping flow are presented in table 9. For the explicit terms, comparison of 
the actual operation rate with the nominal operation rate indicates that the 
internode communication delay and startup costs constitute less than 0.5% of 
the computation time. For solution of the Helmholtz and Poisson equations, 
the startup costs for the Red-Black SOR and ZEBRA 1 methods constitute less 
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than 0.1% and 1% of the computation time, respectively. Timing results for 
the Helmholtz and Poisson equation solutions are given as a function of the 
number of iterations required to attain global convergence of the solution to 
some specified accuracy. The accuracy to which the solutions must be 

resolved, and the number of iterations required to attain this accuracy, must 
still be resolved through numerical experimentation. However, for a computa- 
tional grid of this size the spectral radius for both Red-Black SOR and 
ZEBRA 1 approaches 1, so both iterative methods require on the order of 

thousands of iterations to attain global convergence of the solutions. Hence, 
the sustained operation rate for the algorithm approaches the sustained 
operation rate of the Helmholtz and Poisson equation solvers, which is roughly 
76% and 83% of the Nodal peak speed for the Red-Black SOR and ZEBRA 1 
algorithms, respectively. The sustained operation rate for a full 128 Node 
NSC is projected to be about 43 GFLOPS and 47 GFLOPS, respectively. 

Despite the large operation rates at which the computations are per- 
formed, this simulation requires a substantial amount of computation time. 
Assuming that the ZEBRA 1 algorithm only requires 1000 iterations to attain 

global convergence of the solution for each Helmholtz and Poisson equation, 
which is quite an optimistic assumption, the time required for advancing the 
solution one time step is nearly 6 1/2 hours. Since a full simulation 

requires thousands of time steps, the real time for performing these 

computations in a dedicated environment is on the order of months. Although 
the resolution demands for many laminar -turbulent transition problems require 
far fewer grid points than the 5.4 billion utilized in this example, this 
result indicates that in order to perform this simulation in a reasonable 
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amount of time, iterative techniques with vastly greater convergence rates 
than those for Red-Black SOR and ZEBRA 1 must be employed. 

One means for enhancing the convergence rates for the Helmholtz and 
Poisson equation solvers is to utilize multigrid methods [12, 13] in 
conjunction with suitable single grid iterative schemes. Incorporation of the 
multigrid methods into the iterative schemes involves the implementation of 
additional computational procedures, which involve the projection of the 
values onto coarser grids and then interpolation of the values back to finer 
grids. The projection and interpolation procedures appear to be well suited 
for implementation on the NSC. 

In order to approximate the performance of a multigrid method for 
performing this simulation, the following assumptions are made for incorpo- 
rating a multiqrid method into the ZEBRA 1 algorithm. The effective spectral 
radius for a full multigrid V-cycle is projected to be around 0.5. Defining 
convergence of the solution to be achieved when the residual has been reduced 
by 10“ 5 , 18 V-cycles are required to attain convergence. It is further 
assumed that the computational work for performing one multigrid V-cycle is 
equivalent to that for performing two ZEBRA 1 iterations on the finest grid, 
and that there are no additional internode communication delays. Then for the 
simulation on a 3360x1680x960 grid, the total time to advance the solution one 
time step is projected to be around 14 minutes. 

A second means for enhancing the convergence rates of the Helmholtz and 
Poisson equation solvers is to utilize conjugate gradient methods [14, 15] in 
conjunction with the single grid iterative methods. Incorporation of the 
multigrid and conjugate gradient methods into the algorithm, and implemen- 
tation of those algorithms on the NSC, are the subject of future work. 

- 44 - 



Concluding Remarks 


The NSC is a multi-purpose parallel-processing supercomputer which is 
being developed to provide an efficient means for simulating large , numeri- 
cally intensive, scientific problems* Rapid solution of these problems is 
attained by structuring the computational procedures of the solution 
algorithms to effectively utilize both the fine grain and coarse grain paral- 
lelism inherent in the NSC architecture* The objective of this paper has been 
to present a detailed description of the procedures involved in implementing 
one such algorithm on the NSC* 

Crucial to the effective utilization of the fine grain parallelism is the 
choice of a memory plane allocation procedure for storing the array elements 
of the various variables. The allocation procedure utilized in this analysis 
is fairly complicated in that a lot of "bookkeeping" is required in order to 
keep track of where, and in what order, the primitive variables and intermedi- 
ate results are stored* Less complicated allocation procedures could have 
been used, such as storing all the array elements of a particular variable 
within a single memory plane. For most of the computational procedures out- 
lined in this analysis, use of this simpler allocation procedure would not 
affect significant changes in the RALU pipeline configurations, and only a 
slight degradation in the operation rates would be realized. However, the 
changes which would be required for implementing the ZEBRA 1 iterative 
solution of the Helmholtz and Poisson equations would result in about a 30% 
decrease in the operation rate. Conversely, allocation procedures for which 
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more efficient RALU pipelines can be configured, could have been utilized. 
However, the modest improvement in the operation rates were deemed insuffi- 
cient to justify the significant increase in complexity for implementing the 
procedures • 

Utilization of the coarse grain parallelism involves the distribution of 
the computations over multiple Nodes. For the finite difference algorithm 
considered here, in which fairly simple iterative methods and a Cartesian grid 
with grid stretching in only one coordinate direction are utilized, the compu- 
tational grid is mapped into either a two or three-dimensional lattice of 
Nodes, and the Nodes perform nearly indentical processes. Furthermore, for a 
given computational procedure, less than 2% of the data need be communicated 
between the Nodes. By explicitly defining this boundary-point data and pre- 
communicating the data before it is required in the destination Nodes, most of 
the internode communication overhead for implementing this algorithm has been 
eliminated. 

'The projected timing results for implementing this algorithm on the NSC 
indicate that operation rates at around 75% of the machine peak speed are 
attainable. For a 128 Node NSC, the projected operation rates would be in 
excess of 42 GFLOPS. However, the timing results also indicate that for 
computational grids of the size which can be accommodated on the NSC, the 
convergence rates for the Red-Black SOR and ZEBRA 1 algorithms are inadequate 
for performing this simulation in a reasonable amount of time. In fact, the 
convergence rates for any of the single grid iterative methods are likely to 
be inadequate. Consequently, a more desirable approach appears to be the use 
of conjugate gradient or multigrid methods in conjunction with the iterative 


techniques 
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STEP 
# In 

Procedure 

1st GRID 
POINT 
OF 

VECTOR 

Memory Planes In 

Which Values Are Stored 


OPERANDS 

RESULTS 

u Values 

v Values 

w Values 

n ijk 


F ijk 

1 


1 * 3 , 4, 0, 6 

8 

2 

9 

1 1 

1 2 

3 

(1,2,1 ) 

2 * 0 r 5 * 1 *7 

9 

3 

10 

12 

13 

5 

(1,3,1 ) 

3*1 *6/2/4 

10 

0 

1 1 

13 

14 

7 

(1,4,1) 

0,2, 7, 3, 5 

11 

1 

12 

14 

15 

9 

(1,1,2) 

5, 7, 2, 4,0 

12 

6 

13 

15 

8 

11 

(1,2,2) 

6, 4, 3, 5,1 

13 

7 

14 

8 

9 

I 

13 

(1,3,2) 

7, 5, 0,6, 2 

14 

4 

15 

9 


15 

(1,4,2) 

4, 6, 1,7, 3 

15 

5 

8 

10 



TABLE 1 : Sequence of calculations for the first step in 

the advection term calculation procedure. 
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STEP 


1st GRID 


Memory Planes In Which Values Are Stored 


# In 

Procedure 

POINT 

OF 

VECTOR 

OPERANDS 

RESULTS 

u ijk 

v Values 

w Values 

^ijk 

^k 

G ijk 

■ 

? 

(i 

,1,1 ) 

0 

12,8,14 

3,2,1 

9 

1 1 

4 

13 

4 

(i 

,2,1 ) 

1 

13,9,15 

0,3,2 

10 

12 

5 

14 

6 

(i 

,3,1 ) 

2 

14,10,12 

1*0*3 

1 1 

13 

6 

15 

8 

(i 

,4,1) 

3 

15,11,13 

2,1,0 

12 

14 

7 

8 

10 

M 

(i 

* 1 * 2 ) 

4 

10,12,8 

7,6,5 

13 

15 

0 

9 

12 

; i 

(i 

9 2 * 2 ) 

5 

11*13,9 

4,7,6 

14 

8 

1 


14 

(i 

*3,2) 

6 

12,14,10 

5,4,7 

15 

9 

2 


16 

(i 

*4,2) 

7 

13,15,11 

6,5,4 

8 

10 

3 



TABLE 2: Sequence of calculations for the second step in 

the advection term calculation procedure. 
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STEP # 

In The 
Procedure 

— 
1st GRID 
POINT 
OF 

VECTOR 

Memory Planes In Which Values 

Are Stored 

OPERANDS 

RESULTS 

u ijk 

u 0 6 neighboring points 

— 

L i jk 

1 

Ori,n 

0 

0,1 ,3,4,6 

12 

8 

2 

(1,2,1) 

1 

1 , 2 , 0, 5, 7 

13 

9 

3 

(1,3,1) 

2 

2,3,1 ,6,4 

14 

10 

4 

(1,4,1) 

3 

3,0, 2,7,5 

15 

1 1 

5 

(1,1,2) 

4 

4, 5,7, 2,0 

8 

12 

6 

(1,2,2) 

5 

5, 6,4, 3,1 

9 

13 

7 

(1,3,2) 

6 

i 

6,7, 5, 0,2 

10 

14 

8 

1 

(1 ,4,2) 

7 

7, 4, 6, 1,3 

1 1 

15 


TABLE 3: Sequence of claculations for the right-hand side of the 

Helmholtz equation for the x velocity component. 















1st GRID 
POINT 
OF 



Memory Planes In Which 

Values Are Stored 



OPERANDS 

RESULTS 

OPERANDS 

RESULTS 

OPERANDS 

RESULTS 

VECTOR 

p values 

u* . , 
ijk 

n+1/2 
u. .. 

P Values 

V* 

ink 

n+1/2 
v. ' 
ink 

P Values 

w* 

ink 

n+1 /2 
w. 
i]k 

(1,1,1) 

10 

0 

15 

11,9 

8 

7 

<N 

•» 

o 

TT 

2 

13 

(1,2,1) 

11 

1 

14 

00 

o 

9 

6 

15,11,13 

3 

12 

(1,3,1 ) 

8 


13 

9,11 

10 

5 

12,8,14 

0 

15 

(1,4,1) 

9 


12 

10,8 

11 

4 

13,9,15 

1 

14 

(1,1,2) 

14 

4 

1 1 

15,13 

12 

3 

00 

o 

6 

9 

(1,2,2) 

15 

5 

10 

to 

4^ 

13 

2 

9,15,11 

7 

8 

(1,3,2) 

12 

6 

9 

13,15 

14 

1 

o 

to 

00 

4 

1 1 

(1,4,2) 

13 

7 

8 

14,12 

15 

0 

11 ,13,9 

5 

10 


TABLE 5: Sequence of calculations for the velocity correction procedure. 
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Memory Planes In Which Values Are Stored 


1st GRID 


POINT 

OF 

VECTOR 

OPERANDS 

RESULTS 

HP to MP 

transfer 

m 
u. 
il * 

u @ 6 neighboring points 

L ijk 

m+1 

u ijk 

From 

To 

) 

0 

0,1, 3, 4, 6 

8 

15 

9 

5 

(2,2,1) 

1 

1 , 2 , 0 , 5 ,! 

9 

14 

8 

4 

(1,3,1) 

2 

2,3,1 ,6,4 

10 

13 

15 

0 

(2,4,1) 

3 

3 , 0, 2 , 5 ,7 

11 

12 

14 

1 

(1,4,2) 

7 

7, 4, 6, 1,3 

15 

1 1 

13 

2 

(2,3,2) 

6 

6, 7, 5, 0,2 

14 

10 

12 

3 

(1,2,2) 

5 

5, 6, 4, 3,1 

13 

9 

1 1 

7 

(2,1 ,2) 

4 

4,5,7 ,2,0 

12 

8 

10 

6 

(2,1,1) 

0 

0,1 ,3,4,6 

8 

15 

9 

5 

(1,2,1) 

1 1 

1 ,2, 0,5, 7 

9 

14 

8 

4 

(2,3,1) 

2 ! 

2,3,1 ,6,4 

10 

13 

15 

0 

(1,4,1) 

3 

3, 0,2, 5, 7 

11 

12 

14 

1 

(2,4,2) 

7 

7,4,6, 1 ,3 

15 

1 1 

13 

2 

(1,3,2) 

6 

6, 7, 5, 0,2 

14 

10 

12 

3 

(2,2,2) 

5 

5, 6, 4, 3,1 

1 3 

9 

1 1 

7 

(1,1,2) 

4 

4,5,7, 2,0 

12 

8 

10 

6 


TABLE 6: Sequence of calculations for the Red-Black SOR solution 

of the Helmholtz equation for the x velocity component. 
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Memory Planes In Which Values Are Stored 


1st GRID 


fUXNT 

OF 

VECTOR 

OPERANDS 

RESULTS 

MP to MP 

transfer 

u m . v 

i]k 

u @ neighboring points 

L ijk 

®k-i 

F k-i 

% 

F k 

From 

To 

(1,1,k) 

0 

0,1 ,3,4,6 

8 

15 

12 

9 

10 

0 

5 

(1,1, k+1 ) 

4 

4, 5, 7, 2,0 

12 

9 

10 

13 

14 

4 

3 

(1 ,1 ,k+2 ) 

2 

2,3,1 ,6,4 

10 

13 

14 

11 

8 

2 

7 

(1,1, k+3) 

6 

6, 7, 5, 0,2 

14 

11 

8 

15 

12 

6 

1 

(1,1 ,k+4) 

0 

0,1, 3, 4, 6 

8 

15 

12 

9 

10 

0 

5 

(1,1, k+5) 

4 

4, 5, 7, 2,0 

12 

9 

10 

13 

14 

4 

3 

(1,1 ,k+6) 

2 

2,3,1 ,6,4 

10 

13 

14 

11 

8 

2 

7 

(1,1, k+7) 

6 

6, 7, 5, 0,2 

14 

11 

8 

15 

12 

6 

1 

(1 ,1 ,k+8) 

0 

0,1, 3, 4, 6 

8 

15 

12 

9 

10 

0 

5 

(1,1, k+9) 

4 

4, 5, 7, 2,0 

12 

9 

10 

13 

14 

4 

3 

(1,1 ,k+1 0 ) 

2 

2, 3, 1,6, 4 

10 

13 

14 

11 

8 

2 

7 

(1 ,1,k+11 ) 

6 

6, 7, 5, 0,2 

14 

11 

8 

15 

12 

6 

1 


TABLE 7: Sequence of calculations for E and P in the ZEBRA 1 

solution of the Helmholtz equation for the x velocity 
component. 
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Memory Planes In Which Values Are Stored 



OPERANDS 

RESULTS 




u, E, &F @ k+2 

u,E,&F @ k+1 

U,E,&F @ k 


m+1 

U ijk+2 

m+1 

U ijk+1 

m+1 

u ijk 

k+9 

1 ,15,12 

7,11,8 

3,13,14 

6 

2 

4 

k+6 

5,9,10 

1,15,12 

7,11 ,8 

0 

6 

2 

k+3 

3,13,14 

5,9,10 

1,15,12 

4 

0 

6 

k 

7,11 ,8 

3,13,14 

5,9,10 

2 

4 

0 


TABLE 8: Sequence of calculations for updating u in the ZEBRA 1 solution 

of the Helmholtz equation for the x velocity component. 


Projected Timing Results for a True Spatially Developing Flow 


Computational 


Iterative Method 

Procedure 

Parameter 

Red -Black SOR 

ZEBRA 1 


Subdomain Dimensions 

420x420x420 

210x210x960 

Explicit 

Computation Time (sec.) 

29.17 

29.24 

Terms 

Operation Rate (MPLOPS) 

296.5 

295.3 


Nominal Operation Rate 

| 

296.7 

296.7 

i 

Helmholtz and 

Computation Time (sec.) 

1 .06+4.23(p r +3h r ) 

5.70(p z +3h z ) 

Poisson Bguation 

Operation Rate 

334.9 

! 

367.9 

Solvers 

Nominal Operation Rate 

335.0 

i 

371 .2 


Total Computation Time 

30 . 23+4.23 (p r +3h r ) 

29 . 24+5 . 70 ( p z +3h z ) 


TABLE 9: Projected Timing Results for a True Spatially Develop- 

ing Flow on a 3360x1680x960 Computational Grid. 
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Figure 1.- Overall layout of the Navier-Stokes computer. 





512 Mword/node 
16 Memory planes 
32 Mwords/plane 



Reconfigurable ALU pipeline 
480 Mflops 


Figure 2.- 


Layout of memory /MASNET/RALU interconnects 



1-1 Jk i+IJk Ij-lk ii+lk ijk-1 ijk+1 


Const. 


Const. 



Logic 
un it | 

I Microinterrupt on 
convergence 

Block diagram of pipeline for performing a point 
Jacobi iteration of the 3-D Poisson equation. The 
operation performed by each floating-point 
processing element is indicated by +, - , or x. 
Constant parameter latches are only indicated for 
elements in which the constant is utilized for this 
procedure . 
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Figure 6 


Schematic of the memory plane allocation procedure 
for the primitive variable u. 


Plane k + 1 Plane 





































































Figure 7 


Schematic of the memory plane allocation procedure 
for the primitive variable v. 



























































Figure 9 


Schematic of the memory plane allocation procedure 
for the primitive variable P. 
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Figure 11.- Block diagram of pipeline for the second step in the 
advection term calculation procedure. 







































Memory planes (operands) 
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Figure 13.— Block diagram of pipeline for calculation of the 
right-hand side of the Poisson equation. 























Memory planes (operands) 
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Figure 14.- Block diagram of pipeline for the velocity 







Figure 15 


Block diagram of pipeline for the red-black SOR 
iteration procedure. 
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Figure 16 


Block diagram of pipeline for the first step in the 
ZEBRAl iteration procedure. 
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